We propose a novel teacher-student model for semi-supervised multi-organ segmentation. In teacher-student model, data augmentation is usually adopted on unlabeled data to regularize the consistent training between teacher and student. We start from a key perspective that fixed relative locations and variable sizes of different organs can provide distribution information where a multi-organ CT scan is drawn. Thus, we treat the prior anatomy as a strong tool to guide the data augmentation and reduce the mismatch between labeled and unlabeled images for semi-supervised learning. More specifically, we propose a data augmentation strategy based on partition-and-recovery N$^3$ cubes cross- and within- labeled and unlabeled images. Our strategy encourages unlabeled images to learn organ semantics in relative locations from the labeled images (cross-branch) and enhances the learning ability for small organs (within-branch). For within-branch, we further propose to refine the quality of pseudo labels by blending the learned representations from small cubes to incorporate local attributes. Our method is termed as MagicNet, since it treats the CT volume as a magic-cube and $N^3$-cube partition-and-recovery process matches with the rule of playing a magic-cube. Extensive experiments on two public CT multi-organ datasets demonstrate the effectiveness of MagicNet, and noticeably outperforms state-of-the-art semi-supervised medical image segmentation approaches, with +7% DSC improvement on MACT dataset with 10% labeled images.
translated by 谷歌翻译
Recent advances in artificial intelligence (AI) have significantly intensified research in the geoscience and remote sensing (RS) field. AI algorithms, especially deep learning-based ones, have been developed and applied widely to RS data analysis. The successful application of AI covers almost all aspects of Earth observation (EO) missions, from low-level vision tasks like super-resolution, denoising, and inpainting, to high-level vision tasks like scene classification, object detection, and semantic segmentation. While AI techniques enable researchers to observe and understand the Earth more accurately, the vulnerability and uncertainty of AI models deserve further attention, considering that many geoscience and RS tasks are highly safety-critical. This paper reviews the current development of AI security in the geoscience and RS field, covering the following five important aspects: adversarial attack, backdoor attack, federated learning, uncertainty, and explainability. Moreover, the potential opportunities and trends are discussed to provide insights for future research. To the best of the authors' knowledge, this paper is the first attempt to provide a systematic review of AI security-related research in the geoscience and RS community. Available code and datasets are also listed in the paper to move this vibrant field of research forward.
translated by 谷歌翻译
The number of international benchmarking competitions is steadily increasing in various fields of machine learning (ML) research and practice. So far, however, little is known about the common practice as well as bottlenecks faced by the community in tackling the research questions posed. To shed light on the status quo of algorithm development in the specific field of biomedical imaging analysis, we designed an international survey that was issued to all participants of challenges conducted in conjunction with the IEEE ISBI 2021 and MICCAI 2021 conferences (80 competitions in total). The survey covered participants' expertise and working environments, their chosen strategies, as well as algorithm characteristics. A median of 72% challenge participants took part in the survey. According to our results, knowledge exchange was the primary incentive (70%) for participation, while the reception of prize money played only a minor role (16%). While a median of 80 working hours was spent on method development, a large portion of participants stated that they did not have enough time for method development (32%). 25% perceived the infrastructure to be a bottleneck. Overall, 94% of all solutions were deep learning-based. Of these, 84% were based on standard architectures. 43% of the respondents reported that the data samples (e.g., images) were too large to be processed at once. This was most commonly addressed by patch-based training (69%), downsampling (37%), and solving 3D analysis tasks as a series of 2D tasks. K-fold cross-validation on the training set was performed by only 37% of the participants and only 50% of the participants performed ensembling based on multiple identical models (61%) or heterogeneous models (39%). 48% of the respondents applied postprocessing steps.
translated by 谷歌翻译
Attention-based neural networks, such as Transformers, have become ubiquitous in numerous applications, including computer vision, natural language processing, and time-series analysis. In all kinds of attention networks, the attention maps are crucial as they encode semantic dependencies between input tokens. However, most existing attention networks perform modeling or reasoning based on representations, wherein the attention maps of different layers are learned separately without explicit interactions. In this paper, we propose a novel and generic evolving attention mechanism, which directly models the evolution of inter-token relationships through a chain of residual convolutional modules. The major motivations are twofold. On the one hand, the attention maps in different layers share transferable knowledge, thus adding a residual connection can facilitate the information flow of inter-token relationships across layers. On the other hand, there is naturally an evolutionary trend among attention maps at different abstraction levels, so it is beneficial to exploit a dedicated convolution-based module to capture this process. Equipped with the proposed mechanism, the convolution-enhanced evolving attention networks achieve superior performance in various applications, including time-series representation, natural language understanding, machine translation, and image classification. Especially on time-series representation tasks, Evolving Attention-enhanced Dilated Convolutional (EA-DC-) Transformer outperforms state-of-the-art models significantly, achieving an average of 17% improvement compared to the best SOTA. To the best of our knowledge, this is the first work that explicitly models the layer-wise evolution of attention maps. Our implementation is available at https://github.com/pkuyym/EvolvingAttention
translated by 谷歌翻译
Generalist models, which are capable of performing diverse multi-modal tasks in a task-agnostic way within a single model, have been explored recently. Being, hopefully, an alternative to approaching general-purpose AI, existing generalist models are still at an early stage, where modality and task coverage is limited. To empower multi-modal task-scaling and speed up this line of research, we release a generalist model learning system, OFASys, built on top of a declarative task interface named multi-modal instruction. At the core of OFASys is the idea of decoupling multi-modal task representations from the underlying model implementations. In OFASys, a task involving multiple modalities can be defined declaratively even with just a single line of code. The system automatically generates task plans from such instructions for training and inference. It also facilitates multi-task training for diverse multi-modal workloads. As a starting point, we provide presets of 7 different modalities and 23 highly-diverse example tasks in OFASys, with which we also develop a first-in-kind, single model, OFA+, that can handle text, image, speech, video, and motion data. The single OFA+ model achieves 95% performance in average with only 16% parameters of 15 task-finetuned models, showcasing the performance reliability of multi-modal task-scaling provided by OFASys. Available at https://github.com/OFA-Sys/OFASys
translated by 谷歌翻译
Speech representation learning has improved both speech understanding and speech synthesis tasks for single language. However, its ability in cross-lingual scenarios has not been explored. In this paper, we extend the pretraining method for cross-lingual multi-speaker speech synthesis tasks, including cross-lingual multi-speaker voice cloning and cross-lingual multi-speaker speech editing. We propose a speech-text joint pretraining framework, where we randomly mask the spectrogram and the phonemes given a speech example and its transcription. By learning to reconstruct the masked parts of the input in different languages, our model shows great improvements over speaker-embedding-based multi-speaker TTS methods. Moreover, our framework is end-to-end for both the training and the inference without any finetuning effort. In cross-lingual multi-speaker voice cloning and cross-lingual multi-speaker speech editing tasks, our experiments show that our model outperforms speaker-embedding-based multi-speaker TTS methods. The code and model are publicly available at PaddleSpeech.
translated by 谷歌翻译
Spatial autocorrelation and spatial heterogeneity widely exist in spatial data, which make the traditional machine learning model perform badly. Spatial domain generalization is a spatial extension of domain generalization, which can generalize to unseen spatial domains in continuous 2D space. Specifically, it learns a model under varying data distributions that generalizes to unseen domains. Although tremendous success has been achieved in domain generalization, there exist very few works on spatial domain generalization. The advancement of this area is challenged by: 1) Difficulty in characterizing spatial heterogeneity, and 2) Difficulty in obtaining predictive models for unseen locations without training data. To address these challenges, this paper proposes a generic framework for spatial domain generalization. Specifically, We develop the spatial interpolation graph neural network that handles spatial data as a graph and learns the spatial embedding on each node and their relationships. The spatial interpolation graph neural network infers the spatial embedding of an unseen location during the test phase. Then the spatial embedding of the target location is used to decode the parameters of the downstream-task model directly on the target location. Finally, extensive experiments on thirteen real-world datasets demonstrate the proposed method's strength.
translated by 谷歌翻译
部分可观察性 - 代理只能观察有关系统真正潜在状态的部分信息 - 在增强学习(RL)的现实应用中无处不在。从理论上讲,在最坏情况下,由于指数样本的复杂性下限,在最坏情况下学习了近距离观察性的近乎最佳政策。最近的工作已经确定了几个可通过多项式样本学习的可学性亚类,例如部分可观察到的马尔可夫决策过程(POMDPS)具有某些可揭示或可分解性条件。但是,这一研究仍处于起步阶段,(1)缺乏统一的结构条件,从而缺乏样品效率学习; (2)现有的已知拖拉子类的样品复杂性远非锋利; (3)与完全可观察的RL相比,可用的样品效率算法更少。本文在预测状态表示(PSRS)的一般环境中,上面的所有三个方面都在部分可观察到的RL方向前进。首先,我们提出了一种称为\ emph {b稳定性}的自然和统一的结构条件。 B稳定的PSR包括绝大多数已知的可牵引子类,例如弱揭示的POMDP,低级别的未来pomdps,可解码的POMDP和常规PSR。接下来,我们证明可以在相关问题参数中使用多项式样本学习任何B稳定PSR。当在上述子类中实例化时,我们的样本复杂性比当前最好的复杂性大大改善。最后,我们的结果是通过三种算法同时实现的:乐观的最大似然估计,估计到决策和基于模型的乐观后验采样。后两种算法是用于POMDPS/PSR的样品有效学习的新算法。
translated by 谷歌翻译
寻找统一的复杂性度量和样本效率学习的算法是增强学习研究的核心主题(RL)。 Foster等人最近提出了决策估计系数(DEC)。 (2021)作为样品有效的NO-REGRET RL的必要和足够的复杂度度量。本文通过DEC框架朝着RL的统一理论取得了进步。首先,我们提出了两项​​新的DEC类型复杂性度量:探索性DEC(EDEC)和无奖励DEC(RFDEC)。我们表明,它们对于样本有效的PAC学习和无奖励学习是必要的,因此扩展了原始DEC,该DEC仅捕获了无需重新学习。接下来,我们为所有三个学习目标设计新的统一样品效率算法。我们的算法实例化估计到决策的变体(E2D)元算法具有强大而通用的模型估计值。即使在无重组的设置中,我们的算法E2D-TA也会在Foster等人的算法上提高。 (2021)需要对DEC的变体进行边界,该变体可能是过于大的,或者设计特定问题的估计值。作为应用程序,我们恢复了现有的,并获得了使用单个算法的各种可拖动RL问题的新样品学习结果。最后,作为一种连接,我们根据后采样或最大似然估计重新分析了两种现有的基于乐观模型的算法,表明它们在与DEC相似的结构条件下具有与E2D-TA相似的遗憾界限。
translated by 谷歌翻译
多对象跟踪(MOT)是最基本的计算机视觉任务之一,它有助于各种视频分析应用程序。尽管最近取得了有希望的进展,但当前的MOT研究仍仅限于输入流的固定采样帧速率。实际上,我们从经验上发现,当输入帧速率变化时,所有最新最新跟踪器的准确性都会急剧下降。对于更智能的跟踪解决方案,我们将研究工作的注意力转移到了帧速率不可知MOT(FRAMOT)的问题上。在本文中,我们建议使用定期培训计划(FAPS)的帧速率不可知的MOT框架,以首次解决FRAMOT问题。具体而言,我们提出了一个帧速率不可知协会模块(FAAM),该模块(FAAM)渗透并编码帧速率信息,以帮助跨多帧速率输入的身份匹配,从而提高了学习模型在处理FRAMOT中复杂的运动体验关系方面的能力。此外,FRAMOT中训练和推理之间的关联差距扩大,因为训练中未包含的那些后处理步骤在较低的帧速率方案中产生了更大的影响。为了解决这个问题,我们建议定期培训计划(PTS),以通过跟踪模式匹配和融合来反映培训中的所有后处理步骤。除了提出的方法外,我们首次尝试以两种不同的模式(即已知的帧速率和未知帧速率)建立这项新任务的评估方法,旨在处理更复杂的情况。在具有挑战性的MOT数据集(FRAMOT版本)上进行的定量实验清楚地表明,所提出的方法可以更好地处理不同的帧速率,从而改善对复杂情况的鲁棒性。
translated by 谷歌翻译